Data Science for Biologists
Associate Professor in Data Science and Genetics at the University of East Anglia.
Academic background in Behavioural Ecology, Genetics, and Insect Pest Control.
Teach Genetics, Programming, and Statistics
One workshop per week
One lecture per week
One assignment per week
One ‘capstone’ project
I hope you end up with more questions than answers!
For Research to be reproducible both data and methods should be available.
Applying the described methods to the data leads to the same results
In theory, method availability ≠ code
But with complex data and analyses - are methods of data collection enough?
Science advances incrementally by identifying and rectifying errors over time
Peer review: Critical evaluation of papers by experts maintain quality
Independent studies either support or fail to replicate findings
Publication bias: preference for positive results
Pressure to publish
Poor study designs and statistical issues
Lack of transparency
The reproducibility crisis emerged when numerous studies, especially in fields like psychology, medicine, and biology, failed to be replicated by other researchers.
High-profile replication attempts revealed that many published results could not be consistently reproduced, raising doubts about their validity.
Recognition that no study should be considered ‘definitive’
Empower lasting systemic change through increased transparency in research methods, data sharing and reporting
Structural change in academic culture
Open science is a global movement that aims to make scientific research and its outcomes freely accessible to everyone. By fostering practices like data sharing and preregistration, open science not only accelerates scientific progress but also strengthens trust in research findings.
UK Reproducibility Network - funded by UK Research Council
46 member institutions (UEA is one)
Establish open research practices across UK Research
/home/phil/Documents/paper
├── abstract.R
├── correlation.png
├── data.csv
├── data2.csv
├── fig1.png
├── figure 2 (copy).png
├── figure.png
├── figure1.png
├── figure10.png
├── partial data.csv
├── script.R
└── script_final.R
README
Documented
Easy to code with
All files are inside the root folder
What do you think are the contents of these files:
data/raw/madrid_minimum-temperature.csv
scripts/02_compute_mean-temperature.R
analysis/01_madrid_minimum-temperature_descriptive-statistics.qmd
Come up with good names for these:
a dataset of cats with columns for weight, length, tail length, fur colour(s), fur type and name.
a script that downloads data from Spotify.
a scripts that cleans up data.
a scripts that fits a linear discriminant model and saves it to a file.
Use projects
Check your code runs on blank slates
Automates the creation of a paper or report
Saves time
Reduces errors
(https://www.nature.com/articles/d41586-022-00563-z)
What is a Statistical Model?
A model is a simplified representation of real-world processes.
It helps us describe, explain, and predict outcomes.
A good fit makes accurate predictions; a poor fit can lead to misleading conclusions.
To make reliable inferences, the model must accurately represent the data.
“Later, we’ll see how models help us test hypotheses using p-values to assess if the data fits our expectations.”
Population:
The entire group you want to study (e.g., all humans, all mice in a lab).
Studying the entire population is often impractical due to time, cost, or logistics.
Sample:
A subset of the population, selected to make inferences about the whole.
Must be representative to ensure accurate conclusions.
Summarize data to highlight key features:
Central Tendency: Where is the “center” of the data? (mean, median, mode)
Spread: How variable are the data? (variance, standard deviation)
Helps us understand the data before making inferences.
The central tendency of a series of observations is a measure of the “middle value”
The three most commonly reported measures of central tendency are the sample mean, median, and mode.
| Mean | Median | Mode |
|---|---|---|
| The average value | The middle value | The most frequent value |
| Sum of the total divided by n | The middle value (if n is odd). The average of the two central values (if n is even) | The most frequent value |
| Most common reported measure, affected by outliers | Less influenced by outliers, improves as n increases | Less common |
One of the simplest statistical models in biology is the mean
| Lecturer | Friends |
|---|---|
| Mark | 5 |
| Tony | 3 |
| Becky | 3 |
| Ellen | 2 |
| Phil | 1 |
| Mean | 2.6 |
| Median | 3 |
| Mode | 3 |
Calculating the mean:
\[ \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} \] \[\frac{5 + 3 + 3 + 2 + 1}{5} = 2.6\]
Sum all the values and divide by n of values
One of the simplest statistical models in biology is the mean
| Lecturer | Friends |
|---|---|
| Mark | 5 |
| Tony | 3 |
| Becky | 3 |
| Ellen | 2 |
| Phil | 1 |
| Mean | 2.6 |
| Median | 3 |
| Mode | 3 |
We already know this is a hypothetical value as you can’t have 2.6 friends (I think?)
Now with any model we have to know how well it fits/ how accurate it is
If data is symmetrically distributed, the mean and median will be close, especially as n increases.
This is also know as a normal distribution
Variance:
The average of the squared differences from the mean.
\[s{^2}_{sample} = \frac{\sum(x - \bar{x})^2}{N -1}\] Higher variance = more spread.
Standard Deviation (SD):
The square root of variance.
Easier to interpret because it’s in the same units as the data.
| Lecturer | Friends | Residuals | Sq Resid |
|---|---|---|---|
| Mark | 5 | 2.4 | 5.76 |
| Tony | 3 | 0.4 | 0.16 |
| Becky | 3 | 0.4 | 0.16 |
| Ellen | 2 | -0.6 | 0.36 |
| Phil | 1 | -1.6 | 2.56 |
| Mean | 2.6 |
Sum of Squared Residuals = 9
N-1 = 4
Variance = 9/4 = 2.25
Square root of sample variance
A measure of dispersion of the sample
Smaller SD = more values closer to mean, larger SD = greater data spread from mean
variance:
\[ \sigma = \sqrt{\sum(x - \overline x)^2\over n - 1} \]
For a population the variance \(S_p^2\) is exactly the mean squared distance of the values from the population mean
\[ s{^2}_{pop} = \frac{\sum(x - \bar{x})^2}{N} \]
But this is a biased estimate for the population variance
A biased sample variance will underestimate population variance
n-1 (if you take a large enough sample size, will correct for this)
| Lecturer | Friends | Diff | Squared diff |
|---|---|---|---|
| Mark | 5 | 2.4 | 5.76 |
| Tony | 3 | 0.4 | 0.16 |
| Becky | 3 | 0.4 | 0.16 |
| Ellen | 2 | -0.6 | 0.36 |
| Phil | 1 | -1.6 | 2.56 |
| Mean | 2.6 | ||
| variance | 2.25 | ||
| SD | 1.5 |
Small \(s\) = data points are clustered near the mean
Large \(s\) = data points are widely dispersed around the mean
Shape: Symmetrical, bell-shaped curve
Described by just two parameters: mean (μ) and standard deviation (σ).
Rule of Thumb: For normally distributed data:
~68% within 1 SD
~95% within 2 SDs
~99.7% within 3 SDs
Relevance: Many statistical tests assume data follows a normal distribution. This helps us calculate probabilities—like p-values—to test hypotheses.
If we assume a normal distribution (or close enough), we can calculate the probability of observing any given value using just the mean and standard deviation.
\[ f(x) = \frac{1}{\sqrt{2\pi} \, \sigma} \exp\!\Biggl(-\frac{(x - \mu)^2}{2\,\sigma^2}\Biggr) \]
This has applications in hypothesis testing and building confidence intervals
Histograms plot frequency/density of observations within bins
Quantile-Quantile plots plot quantiles of a dataset vs. quantiles of a theoretical (usually normal) distribution
Problem: Different datasets have different means and standard deviations.
Solution: Standardization allows comparisons by converting any normal distribution into a standard normal distribution (mean = 0, SD = 1).
\[ Z = \frac{X - \mu}{\sigma} \]
Z: How many standard deviations a value is from the mean.
X: Observed value.
μ: Population mean.
σ: Population standard deviation.
The standard normal distribution allows us to calculate probabilities of observing extreme values.
Example: If 𝑍 = 2, the observation is 2 standard deviations above the mean.
Using a Z-table, we can find the probability of getting a result at least this extreme.
This concept extends to p-values, which tell us how rare our data is under the null hypothesis.
A p-value is the probability of obtaining results at least as extreme as the ones we observed, assuming the null hypothesis is true.
It helps quantify how surprising or unusual our data is under the null hypothesis.
A low p-value suggests that the observed data would be rare if the null hypothesis were true, which may lead us to question the null hypothesis
Imagine flipping a coin 100 times, expecting about 50 heads if it’s fair (the null hypothesis). If you observe 90 heads, the p-value tells you how likely it is to get such an extreme result just by chance.
Example: A new drug has no effect compared to a placebo.
Example: The new drug does improve patient outcomes compared to the placebo.
The shaded areas in the tails represent extreme outcomes (typically the most unexpected 5% if using an \(\alpha\) = 0.05).
If your observed data falls within these tails, it’s considered statistically significant, suggesting it’s unlikely to occur by random chance under the null hypothesis.
Key Point: Larger sample sizes reduce variability, making it easier to detect small differences as statistically significant.
However, statistical significance doesn’t always mean practical importance—especially with large samples.
Scenario: Testing if diet A and diet B affect mice longevity.
Null Hypothesis (H₀): No difference in longevity between diets.
Alternative Hypothesis (H₁): There is a difference in longevity.
lm(longevity ~ diet, data = mice)
Explanation: This linear model performs a t-test to assess if the mean lifespans differ significantly between diets.
The t-distribution shows where your observed mean difference falls relative to what’s expected under the null hypothesis.
Confidence Intervals (CI): The red area marks the 95% CI. If zero falls outside this range, it suggests statistical significance.
# A tibble: 2 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 4.19 0.276 15.2 3.87e-13 3.62 4.77
2 dietmice_b -1.41 0.391 -3.60 1.60e- 3 -2.22 -0.595
“Diets can change the longevity of mice (p = 0.0016).”
“Mice on diet B lived significantly shorter lives than mice on diet A (t22 = -3.6, p = 0.006).”
“Mice on diet B had a reduced mean lifespan of 1.41 years[95% CI; -0.595:-2.22] from the mice on diet A (mean 4.19 years (95% CI, 3.62-4.77). While statistically significant (t22 = -3.6, p = 0.006), this is a relatively small sample size, and further testing is recommended to confirm this effect.”
Which has the greatest level of useful detail?
Standard error measures the accuracy of a sample mean as an estimate of the population mean.
\[ SE = {s\over \sqrt n} \]
\(s = \sqrt{\sum(x - \overline x)^2\over n - 1}\)
\(SE(\overline{X}) = \frac{s}{\sqrt(N)}\)
Standard Error of the Mean (\(\overline{X}\)) is the **Standard Deviation of the Sampling Distribution of the Mean.
If you took many samples of size 𝑛from the same population, each sample would yield a different sample mean.
All these sample means form their own distribution: the sampling distribution of the mean.
The Standard Error measures the width (standard deviation) of this distribution of sample means.
Larger 𝑛→ smaller SE → sample means cluster more tightly around the true mean.
For large 𝑛the distribution of sample means becomes approximately normal, no matter the original data’s shape (assuming finite variance & independence).
Lets us use normal-based methods (like confidence intervals and hypothesis tests) even if the original data are not normal—provided 𝑛 is large.
What is a sufficient 𝑛?
When the data is very close to a normal distribution it might be very small c.10
When the data is slightly skewed it might be 𝑛 = 30
When the data follows an alternative distribution it could be very large
If you know the data follows a specific distribution (e.g. Poisson, Binomial, etc) we can base our inference on these - see Generalised Linear Models
Even if two samples are both drawn from the same underlying population, they won’t have identical means.
Question: When is that difference “big enough” to claim they come from different populations?
Measures the spread (uncertainty) of one sample mean around its population mean.
\(Mean~difference = \overline x_1 - \overline x_2\)
\(SED = \sqrt{s_1^2\over n_1}+{s_2^2\over n_2}\)
Standard Error of the difference is calculated from two \(s\) so will always be larger than a single population estimate
Null Hypothesis : The two samples do not differ in their population means
Alternative Hypothesis : The two samples come from populations with different means
Even with identical populations, sample means differ. The t-test checks if that difference is too large to be due to chance.
How different is “different enough”?
When can we determine that two sample means are different enough? To have come from different populations?
Compare the means of two samples
Null hypothesis: there is no difference between the population means
t-statistic: based on difference in sample means, sd & n
For a given sample size and t-value, how unlikely is to have two random samples at least this different IF both samples are drawn from identical/same populations.
Example: A botanist is studying the mass of pollen transported by individual bees from two different hives (Hive A and Hive B).
Hive A: 62 bees, (mean pollen mass = 31.4 mg, variance = 225.0 mg^2)
Hive B: 67 bees, (mean pollen mass = 36.7 mg, variance = 186.0 mg^2)
Is there a significant difference in pollen mass transported per bee for the two hives studied?
Variance \(s^2\) = Sum of squared residuals/ n
Standard deviation (s) = \(\sqrt Variance\)
Calculate the difference in means, and the standard error of the difference in means
Mean Difference = 36.7 - 31.4
Mean Difference = 5.3
\[ SED = \sqrt{s^2_1\over n_1} + {s^2_2\over n_2} \]
\[ SED = \sqrt{225\over 62} + {186\over 67} \]
\[ SED = \sqrt{3.63} + {2.78} \]
\[ SED = 2.53 \]
Mean difference = 5.3
Calculate 95% confidence intervals
Assuming a normal distribution this would be 1.96 * SED = 4.96
Assuming a t distribution = ( \(\alpha\) = 0.975, df = 129-2, SED = 2.53 ) = 5.008
95% CI range = 5.3 [± 5.008] = 0.292 - 10.308
\[ SED = 2.53 \]
Calculate 95% confidence intervals
95% CI range = 5.3 [± 5.008] = 0.292 - 10.308
The mass of pollen transported per bee in Hive B is on average 5.3mg more than Hive A [0.29 - 10.3] (mean [± 95% CI])
The mass of pollen transported per bee in Hive B is on average 5.3mg more than Hive A [0.29 - 10.3] (mean [± 95% CI])
The 95% CI calculated from the t distribution doesn’t include 0 → likely a real difference between the two hive populations.
“At P ≤ 0.05 bees from Hive B transport at least 0.29mg of pollen more per bee than Hive A”
Two Sample t-test
data: hive_a and hive_b
t = 2.09, df = 127, p-value = 0.01152
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.294166 10.30808
We can also produce t-tests (and many other tests besides with the lm() function in R)
A p-value is NOT:
The probability that the null hypothesis is true.
The probability your results occurred “by chance.”
Proof of a meaningful or large effect.
What it IS:
In reality, we can’t know for sure if a true mean difference exists.
For illustration: Assume we could know the true mean difference.
The figure shows:
Grey line: Expected data if the null hypothesis is true.
Black line: Expected data if the alternative hypothesis is true.
A p-value shows how surprising the data are if the null is true.
A low p-value is evidence against the null, not proof of the alternative.
What we can conclude, based on our data, is that we have observed an extreme outcome, that should be considered surprising. But such an outcome is not impossible when the null-hypothesis is true.
If we plot the null model for a very large sample size, we can see that even very small mean differences will be considered ‘surprising’.
However, just because data is surprising, does not mean we need to care about it. It is mainly the verbal label ‘significant’ that causes confusion here – it is perhaps less confusing to think of a ‘significant’ effect as a ‘surprising’ effect.
Difference tests: t-test, ANOVA, ANCOVA
Association tests: Regressions
Basic linear models aim to describe a linear relationship between a:
response (dependent) variable and predictor(s) (independent) variable.
usually by the method of ordinary least squares.
\(Y = \beta0 + \beta_1x1 + \epsilon\)
\(y\) is the predicted value of the response variable
\(\beta0\) is the intercept (value of y when x = 0)
\(\beta_1x1\) is the slope of the regression line
\(\epsilon\) is the error term
Difference tests: t-test, ANOVA, ANCOVA
Call:
lm(formula = longevity ~ type, data = dros_clean)
Residuals:
Min 1Q Median 3Q Max
-31.74 -13.74 0.26 11.44 33.26
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 63.560 3.158 20.130 < 2e-16 ***
typeInseminated 0.520 3.867 0.134 0.893
typeVirgin -15.820 3.867 -4.091 7.75e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.79 on 122 degrees of freedom
Multiple R-squared: 0.2051, Adjusted R-squared: 0.1921
F-statistic: 15.74 on 2 and 122 DF, p-value: 8.305e-07
Association tests: Regressions
Call:
lm(formula = longevity ~ thorax, data = dros_clean)
Residuals:
Min 1Q Median 3Q Max
-28.415 -9.961 1.132 9.265 36.812
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -61.05 13.00 -4.695 7.0e-06 ***
thorax 144.33 15.77 9.152 1.5e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.6 on 123 degrees of freedom
Multiple R-squared: 0.4051, Adjusted R-squared: 0.4003
F-statistic: 83.76 on 1 and 123 DF, p-value: 1.497e-15
Now we have two (or more) reasonable predictors of longevity:
Call:
lm(formula = longevity ~ thorax + type + sleep, data = dros_clean)
Residuals:
Min 1Q Median 3Q Max
-28.153 -6.836 -2.191 7.196 29.046
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -56.04502 11.17882 -5.013 1.87e-06 ***
thorax 144.43008 13.11616 11.012 < 2e-16 ***
typeInseminated 3.62796 2.77122 1.309 0.193
typeVirgin -13.24603 2.76198 -4.796 4.70e-06 ***
sleep -0.05281 0.06383 -0.827 0.410
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.23 on 120 degrees of freedom
Multiple R-squared: 0.6046, Adjusted R-squared: 0.5914
F-statistic: 45.88 on 4 and 120 DF, p-value: < 2.2e-16
\(Y = \beta0 + \beta_1x1 +\beta_2x2 + \beta_3x3 + \epsilon\)
The line of best fit minimises the sum of the squared distance that each sample point is from the value predicted by the model
This is called the method of
The difference between the ACTUAL value of the observation \(y_i\) and the value that the model predicts \(\hat{y_i}\) at that \(x\) value are the residuals (residual error).
The regression model fitted by Ordinary Least Squares (OLS) will produce the equation for the line that MINIMIZES the sum of squares of the residuals
This produces the line with the smallest total area size for the squares.
\[ SSE = \underset{i=1}{n \atop{\sum}}(y_i - \hat{y_i})^2 \]
SUM OF RESIDUAL ERROR = \((2)+(-1)+(4)+(-3)+(-2)=0\)
SUM OF SQUARES OF RESIDUAL ERROR = \((2^2)+(-1^2)+(4^2)+(-3^2)+(-2^2)=34\)
\(\LARGE{R^2}\) : The proportion of variation in the dependent variable that can be predicted from the independent variable
\[ \LARGE{R^2=1-{Sum~of~Squared~Distances~between~y_i~and~\hat{y_i}\over{Sum~of~Squared~Distances~between~y_i~and~\overline{y}}} } \]
\[ \LARGE{R^2=1-{SSE\over{SST}}} \]
Errors should be normally distributed
Homoscedasticity of the errors
Independent observations
A Linear relationship
OK: residuals appear as normally distributed (p = 0.065).
performance::check_model(dros_model,
check = "homogeneity")
performance::check_heteroscedasticity(dros_model)OK: Error variance appears to be homoscedastic (p = 0.155).
OK: No outliers detected.
- Based on the following method and threshold: cook (0.8).
- For variable: (Whole model)
# A tibble: 5 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -56.0 11.2 -5.01 1.87e- 6 -78.2 -33.9
2 thorax 144. 13.1 11.0 6.45e-20 118. 170.
3 typeInseminated 3.63 2.77 1.31 1.93e- 1 -1.86 9.11
4 typeVirgin -13.2 2.76 -4.80 4.70e- 6 -18.7 -7.78
5 sleep -0.0528 0.0638 -0.827 4.10e- 1 -0.179 0.0736
The effect of thorax is statistically significant and positive (beta = 144.43, 95% CI [118.46, 170.40], t(120) = 11.01, p < .001; Std. beta = 0.64, 95% CI [0.52, 0.75])
The effect of type [Inseminated] is statistically non-significant and positive (beta = 3.63, 95% CI [-1.86, 9.11], t(120) = 1.31, p = 0.193; Std. beta = 0.21, 95% CI [-0.11, 0.52])
The effect of type [Virgin] is statistically significant and negative (beta = -13.25, 95% CI [-18.71, -7.78], t(120) = -4.80, p < .001; Std. beta = -0.75, 95% CI [-1.07, -0.44])
The effect of sleep is statistically non-significant and negative (beta = -0.05, 95% CI [-0.18, 0.07], t(120) = -0.83, p = 0.410; Std. beta = -0.05, 95% CI [-0.16, 0.07])
In this example of a simple linear model, we run the equivalent to a Student’s t-test.
Call:
lm(formula = longevity ~ thorax + type + sleep, data = dros_clean)
Residuals:
Min 1Q Median 3Q Max
-28.153 -6.836 -2.191 7.196 29.046
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -56.04502 11.17882 -5.013 1.87e-06 ***
thorax 144.43008 13.11616 11.012 < 2e-16 ***
typeInseminated 3.62796 2.77122 1.309 0.193
typeVirgin -13.24603 2.76198 -4.796 4.70e-06 ***
sleep -0.05281 0.06383 -0.827 0.410
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.23 on 120 degrees of freedom
Multiple R-squared: 0.6046, Adjusted R-squared: 0.5914
F-statistic: 45.88 on 4 and 120 DF, p-value: < 2.2e-16
Q. What happens when we have more than two groups in our predictor variable? Why shouldn’t we just do lots of t-tests?
ANOVAs use information about variances, but the main goal of analysis is comparison of MEANS (don’t let the name mix you up - more on this later).
ANOVA is an omnibus test – it tests for significant differences between any means in the study
ANOVA is just a special case of the linear model
Null Hypothesis (H0): All means are equal (in other words, all groups are from populations with the same mean)
Alternative Hypothesis (HA): At least two group means are NOT equal (that means that just two could be different, or they could ALL be different)
We have collected data for soil uranium concentrations at three locations on Los Alamos National Lab property: Site A, Site B, and Site C. The data structure is shown here:
A one-way ANOVA can be used to assess whether there is a statistically significant difference in uranium concentration in soil at three locations
What does this one-way/single-factor refer to?
There is a single factor (variable), with at least 3 levels, where we are trying to compare means across the different levels of that factor.
ANOVA does this indirectly by looking at total between and within group variances as a ratio (F).
SST The squared differences between the observed dependent variable \(y_i\) and its overall mean \(\overline{y}\).
SSR The sum of the differences between the predicted value \(\hat{y_i}\) of the dependent variable and its overall mean \(\overline{y}\).
SSE The error is the difference between the observed dependent value \(y_i\) and the predicted value \(\hat{y_i}\).
\[ SSE = \underset{i=1}{n \atop{\sum}}(y_i - \hat{y_i})^2 \]
\[ SSR = \underset{i=1}{n \atop{\sum}}(\hat{y_i} - \overline{y})^2 \]
\[ SST = \underset{i=1}{n \atop{\sum}}(y_i - \overline{y})^2 \]
where:
\(y_i\) = Observed value
\(\hat{y_i}\) = Value estimated by model
\(\overline{y}\) = The Grand Mean
\(SSE + SSR = SST\)
\[ \large F = {SSR / (k-1)\over SSE / (N-k)} = {MSR\over MSE} \]
\(k\) = Total number of groups
\(N\) = numerator degrees of freedom = Total number of observations across all groups
\(N-k\) = denominator degrees of freedom
\(MSR\) = Mean Squares Regression
\(MSE\) = Mean Squares Error
This is a ratio of the between group variance and the the within group variance.
The F-value or ratio of variances, over their respective degrees of freedom will have an F-distribution.
This F distribution is used when we want to compare within and between group variances.
The curve of the distribution depends on the degrees of freedom, and it is always positively skewed
The higher the F-value the greater the signal-to-noise ratio.
For a given value of numerator and denominator degrees of freedom we can look up the probability of observing this ratio under a null hypothesis of identical variances.
If F value is high enough then we might have enough evidence to conclude that samples are likely drawn from populations with different means.
Analysis of Variance Table
Model 1: longevity ~ thorax + sleep
Model 2: longevity ~ thorax + type + sleep
Res.Df RSS Df Sum of Sq F Pr(>F)
1 122 22701
2 120 15125 2 7576.9 30.058 2.617e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Single term deletions
Model:
longevity ~ thorax + type + sleep
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 15125 609.47
thorax 1 15282.8 30407 694.77 121.2556 < 2.2e-16 ***
type 2 7576.9 22701 656.23 30.0578 2.617e-11 ***
sleep 1 86.3 15211 608.18 0.6846 0.4097
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F-ratio tells us only whether the model fitted to the data (SSR) accounts for more variance than other factors (SSE).
So if the F-ratio is large enough to be statistically significant, then we only know that one or more differences between means are statistically significant.
Further testing is needed!
You focused on a few scientifically sensible comparisons, rather than every possible comparison
You chose these as part of the experimental design before collecting data
Could structure linear model to reflect this?
Typically unplanned comparisons, conducted only after a significant ANOVA result
All combinations checked
Needs correcting for inflated Type 1 error
| Method | Equation | Notes |
|---|---|---|
| Bonferroni/Dunn-Sidak | \(p = {\alpha \over k}\) | Correct critical p for the number of independent tests |
| Holm’s | \(p_{1} < \alpha/(k–1+1) = \alpha/k\) | we start with the smallest p-value (i = 1) and determine whether there is a significant result (i.e. p1 < α/(k–1+1) = α/k. If so we move on to the second test. We continue in this manner until we get a non-significant result |
| Tukey HSD | \(t_{crit} = q * \sqrt{MSE \over N}\) | Essential a t-test, correcting for multiple comparison, q-values can be looked up for test df and number of treatments |
$emmeans
type emmean SE df lower.CL upper.CL
Control 61.3 2.26 120 56.8 65.8
Inseminated 64.9 1.59 120 61.8 68.1
Virgin 48.0 1.59 120 44.9 51.2
Confidence level used: 0.95
$contrasts
contrast estimate SE df t.ratio p.value
Control - Inseminated -3.63 2.77 120 -1.309 0.3929
Control - Virgin 13.25 2.76 120 4.796 <.0001
Inseminated - Virgin 16.87 2.25 120 7.508 <.0001
P value adjustment: tukey method for comparing a family of 3 estimates
$emmeans
type emmean SE df lower.CL upper.CL
Control 61.3 2.26 120 56.8 65.8
Inseminated 64.9 1.59 120 61.8 68.1
Virgin 48.0 1.59 120 44.9 51.2
Confidence level used: 0.95
$contrasts
contrast estimate SE df lower.CL upper.CL
Control - Inseminated -3.63 2.77 120 -10.20 2.95
Control - Virgin 13.25 2.76 120 6.69 19.80
Inseminated - Virgin 16.87 2.25 120 11.54 22.21
Confidence level used: 0.95
Conf-level adjustment: tukey method for comparing a family of 3 estimates
Hypotheses in research predict relationships between variables.
Linear models help test these hypotheses by quantifying these relationships.
Converting hypotheses into linear models involves identifying dependent and independent variables.
Considering and including important covariates
Determining whether interaction terms need to be included
Question: Does the amount of sunlight affect plant growth?
Hypothesis: More sunlight hours increase plant growth
Linear Model: height ~ sunlight_hours
Question: Does the amount of sunlight affect plant growth?
Hypothesis: More sunlight hours increase plant growth
Control variables: Soil quality
Linear Model: height ~ sunlight_hours + soil quality
Question: How do water and nutrient levels affect plant growth?
Hypothesis: Both increased water and nutrient levels lead to increased plant growth.
Control Variables: Temperature
Linear Model: growth ~ water + nutrient + temp_c
Question: Do water and nutrient levels work together to affect plant growth?
Hypothesis: Increasing water and nutrient levels has a combined effect that leads to increased plant growth, greater than the effect of either on their own.
Control Variables: Temperature
Linear Model: growth ~ water + nutrient + temp_c + water:nutrient
Question: Does temperature affect the metabolic rate of lizards?
Hypothesis: Temperature increases the metabolic rate of lizards
Control Variables: Age of lizards
Linear Model: metabolic_rate ~ temp + age
Question: How does diet and exercise affect cholesterol in mice?
Hypothesis: low fat diets and regular exercise decrease cholesterol levels in mice
Control Variables: Genetic strain of mice
Linear Model: cholesterol ~ diet_fat + daily_exercise_hours + strain
Linear Model: cholesterol ~ diet_fat + daily_exercise_hours + strain + diet_fat:daily_exercise
Question: Do immune system responses depend on stress and sleep in humans?
Hypothesis: Negative effects of stress on immune responses are lessened if sleep duration remains high
Control Variables: Age
Linear Model: immune_response ~ sleep_hours + stress + age + sleep_hours:stress
Discovering Statistics - Andy Field
An Introduction to Generalized Linear Models - Dobson & Barnett
An Introduction to Statistical Learning with Applications in R - James, Witten, Hastie & Tibshirani
Mixed Effects Models and Extensions in Ecology with R - Zuur, et al.
Ecological Statistics with contemporary theory and application
The Big Book of R (https://www.bigbookofr.com/)
Writing statistical methods for ecologists
Reporting statistical methods and outcome of statistical analyses in research articles
Design principles for data analysis
Log-transformation and its implications for data analysis.
Effect size, confidence interval and statistical significance: a practical guide for biologists
Misconceptions, Misuses, and Misinterpretations of P Values and Significance Testing
Ten common statistical mistakes to watch out for when writing or reviewing a manuscript.
Why most published research findings are false
Model averaging and muddled multimodel inference
A brief introduction to mixed effects modelling and multi-model inference in ecology
The Practical Alternative to the p Value Is the Correctly Used p Value